Library loading

library(Seurat)
## The legacy packages maptools, rgdal, and rgeos, underpinning the sp package,
## which was just loaded, were retired in October 2023.
## Please refer to R-spatial evolution reports for details, especially
## https://r-spatial.org/r/2023/05/15/evolution4.html.
## It may be desirable to make the sf package available;
## package maintainers should consider adding sf to Suggests:.
## Attaching SeuratObject
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(omnideconv)
## Loading required package: snowfall
## Loading required package: snow
## Skipping python installation checks ...
library(readxl)

Single cell data processing

First of all, we will need to process the single cell dataset in order to remove low quality cells. We will create a Seurat object with the cell counts and their metadata of interest curated by the authors, which include cell type annotation on three levels of resolution

single.cell.data <- Seurat::ReadMtx(
  mtx = './Wu_etal_2021_BRCA_scRNASeq/count_matrix_sparse.mtx',
  cells = './Wu_etal_2021_BRCA_scRNASeq/count_matrix_barcodes.tsv',
  features = './Wu_etal_2021_BRCA_scRNASeq/count_matrix_genes.tsv',
  feature.column = 1
)
single.cell.metadata <- read.table('./Wu_etal_2021_BRCA_scRNASeq/metadata.csv', 
                                   sep = ',',
                                   header = TRUE, 
                                   row.names = 1)

single.cell.data = CreateSeuratObject(single.cell.data, 
                                      project='Wu_dataset', 
                                      assay='RNA', 
                                      min.cells = 0, 
                                      min.features = 1, meta.data = single.cell.metadata)

We can have an overview of the number of cells per cell type in the dataset:

table(single.cell.data$celltype_major, dnn = list("celltype_major")) %>%
      as.data.frame(., responseName = "number_cells")
##      celltype_major number_cells
## 1           B-cells         3206
## 2              CAFs         6573
## 3 Cancer Epithelial        24489
## 4       Endothelial         7605
## 5           Myeloid         9675
## 6 Normal Epithelial         4355
## 7      Plasmablasts         3524
## 8               PVL         5423
## 9           T-cells        35214

In order to remove low quality cells, we will follow the best practices for single cell normalization. We will perform the quality control on each cell considering as metrics the number of counts, the number of features (genes) and the fraction of mitochondrial genes. The metric considered is the MAD, (Mean Absolute Deviation), computed as \(MAD = median(|X_i - median(X)|)\) with \(X_i\) being the respective metric of an observation (cell). We will consider as outliers cells that have \(X_i < n*|MAD|\), with \(n=5\) for number of counts and number of features, and n=3 for the fraction of mitochondrial genes.

is_outlier <- function(SeuratObject, metric, nmads){
  eval(parse(text = paste0("M <- SeuratObject$",metric)))
  outlier <- (M < median(M) - nmads * mad(M)) | (M > median(M) + nmads * mad(M))
  return(outlier)
}

check_outliers_nFeature <- is_outlier(single.cell.data, 'nFeature_RNA', 5)
check_outliers_nCount <- is_outlier(single.cell.data, 'nCount_RNA', 5)
check_outliers_mito <- is_outlier(single.cell.data, 'percent.mito', 3)

non_outliers_nFeature <- names(check_outliers_nFeature[!check_outliers_nFeature])
non_outliers_nCount <- names(check_outliers_nCount[!check_outliers_nCount])
non_outliers_mito <- names(check_outliers_mito[!check_outliers_mito])

We will retain only those that satisfy all three of the conditions described above.

non_outliers <- intersect(non_outliers_nFeature, non_outliers_nCount) %>% 
  intersect(non_outliers_mito)

single.cell.data <- subset(single.cell.data, cells = non_outliers)

as.data.frame(table(single.cell.data$celltype_major, dnn = list("celltype_major")), responseName = "number_cells")
##      celltype_major number_cells
## 1           B-cells         3150
## 2              CAFs         6154
## 3 Cancer Epithelial        16613
## 4       Endothelial         6991
## 5           Myeloid         8855
## 6 Normal Epithelial         3533
## 7      Plasmablasts         3164
## 8               PVL         5120
## 9           T-cells        34991

Bulk data preprocessing

We will now read in the bulk sequencing data file, which consists of 24 samples.

bulk.data <- read.table('./GSE176078_Wu_etal_2021_bulkRNAseq_raw_counts.txt', skip=1)

header <- read.table('./GSE176078_Wu_etal_2021_bulkRNAseq_raw_counts.txt', 
                     header = FALSE, nrows = 1, skipNul = TRUE, sep='\t')

colnames(bulk.data) <- c('Genes', gsub('A|N', '', header[2:25]))

bulk.data <- bulk.data[bulk.data$Genes != '', ]
bulk.data <- column_to_rownames(bulk.data, 'Genes')
bulk.data <- as.matrix(bulk.data)

Subsampling of single cell data

The various methods included in omnideconv rely on the single cell dataset that will be used to train them for the deconvolution of those specific cell types. This training involves the optimization of internal features of the methods and can happen in different ways. Some methods use the single cell data to build a ‘signature matrix’, i.e. a reduced transcriptional fingerprints of the cell types provided, while others use this data in a statistical or deep learning model. Since single cell datasets can often encompass thousands of cells, we will need to subsample it in order to be able to run the analysis on our machines. In this case we will retain a maximum of 200 cells per cell type, but this step can be costumed, or eventually skipped, depending on the computational resources available.

max_cells_per_celltype = 200

sampled.metadata <- single.cell.data@meta.data %>%
      rownames_to_column(., 'barcode') %>%
      group_by(., celltype_major) %>% 
      nest() %>%            
      mutate(n =  map_dbl(data, nrow)) %>%
      mutate(n = min(n, max_cells_per_celltype)) %>%
      ungroup() %>% 
      mutate(samp = map2(data, n, sample_n)) %>% 
      select(-data) %>%
      unnest(samp)

single.cell.data.sampled <- subset(single.cell.data, cells = sampled.metadata$barcode)


as.data.frame(table(single.cell.data.sampled$celltype_major, dnn = list("celltype_major")), responseName = "number_cells")
##      celltype_major number_cells
## 1           B-cells          200
## 2              CAFs          200
## 3 Cancer Epithelial          200
## 4       Endothelial          200
## 5           Myeloid          200
## 6 Normal Epithelial          200
## 7      Plasmablasts          200
## 8               PVL          200
## 9           T-cells          200

Deconvolution of the bulk data

Each methods has different requirements, but in general to compute the deconvolution results we will need the single cell counts matrix, the cell type annotations and the information on the individual/experiment fom which the cells were retrieved (batch ID).

counts.matrix <- as.matrix(single.cell.data.sampled@assays$RNA@counts)
cell.type.annotations <- single.cell.data.sampled$celltype_major
batch.ids <- single.cell.data.sampled$orig.ident

Deconvolution with DWLS

Now we’re going to deconvolute the bulk dataset with different methods. The first one we are going to use is called DWLS and performs the deconvolution in a two-steps process. First, the single cell data is used to build a signature matrix using the omnideconv function build_model:

# We need to insert the normalization as well
signature.matrix.dwls <- omnideconv::build_model(single_cell_object = counts.matrix,
                        cell_type_annotations = cell.type.annotations,
                        method = 'dwls', 
                        dwls_method = 'mast_optimized')

This signature is optimized so that the genes selected are the ones that help to discriminate across cell types, and it can be manually inspected.

head(signature.matrix.dwls)
##       Endothelial CAFs  PVL B-cells T-cells Myeloid Normal Epithelial
## PLVAP        8.60 0.06 0.05    0.01    0.02    0.05              0.18
## ACKR1        7.83 0.07 0.03    0.00    0.00    0.03              0.18
## RAMP2        8.22 0.14 0.03    0.01    0.00    0.01              0.15
## CLDN5        4.89 0.02 0.03    0.00    0.00    0.01              0.06
## VWF          4.26 0.01 0.04    0.00    0.01    0.00              0.03
## RBP7         5.05 0.01 0.00    0.00    0.00    0.06              0.19
##       Plasmablasts Cancer Epithelial
## PLVAP         0.00              0.02
## ACKR1         0.09              0.02
## RAMP2         0.01              0.19
## CLDN5         0.02              0.00
## VWF           0.01              0.02
## RBP7          0.01              0.06

The signature is now used for the deconvolution of the bulk RNAseq, which is performed with the omnideconv function deconvolute.

deconvolution.results.dwls <- deconvolute(bulk_gene_expression = bulk.data,
                                          signature = signature.matrix.dwls,
                                          method='dwls',
                                          dwls_submethod = 'DampenedWLS')

We will now obtain, for every sample, a set of cell type fractions for each cell type that was included in the provided single cell dataset.

head(deconvolution.results.dwls)
##              B-cells      CAFs Cancer Epithelial Endothelial     Myeloid
## CID3586 2.311927e-01 0.2384089      1.704432e-01  0.08261089 0.004685998
## CID3921 2.635981e-02 0.1769145      1.446501e-01  0.11398995 0.019363855
## CID3941 2.520341e-10 0.2673059      3.726856e-01  0.09053598 0.001668845
## CID3948 4.661399e-01 0.1069023      2.642584e-01  0.04874744 0.007327080
## CID3963 1.120733e-01 0.2083548      2.827959e-08  0.08767865 0.103464203
## CID4066 4.913408e-02 0.3417252      1.736733e-01  0.10236788 0.018905468
##         Normal Epithelial  Plasmablasts          PVL    T-cells
## CID3586      3.548583e-02  8.384038e-26 2.181795e-02 0.21535458
## CID3921      6.807888e-02  3.844282e-03 1.470788e-19 0.44679865
## CID3941      1.795141e-01  2.554558e-14 2.965864e-02 0.05863092
## CID3948      5.598139e-14  6.855057e-04 2.001737e-02 0.08592200
## CID3963      3.451726e-01 -3.993464e-22 2.044198e-02 0.12281444
## CID4066      1.546639e-01 -6.602120e-21 4.873661e-02 0.11079364

We can also visualise the results as a barplot trough the built-in plot_deconvolution function

omnideconv::plot_deconvolution(list('dwls' = deconvolution.results.dwls), "bar", "method", "Spectral")
omnideconv::plot_deconvolution(list('dwls' = deconvolution.results.dwls.minor), "bar", "method", "Spectral")

BayesPrism deconvolution

The third method we will use is BayesPrism. This method is based on a Bayesian framework and models the transcriptomic expression observed in the scRNA-seq dataset. It then uses this information to dissect te bulk RNA-seq.

# BayesPrism deconvolution

deconvolution.results.bayesprism <- deconvolute(bulk_gene_expression = bulk.data,
                                           single_cell_object = counts.matrix,
                                           cell_type_annotations = cell.type.annotations,
                                           signature=NULL,
                                           method = 'bayesprism', 
                                           n_cores=12)

We can visualize the results as before:

omnideconv::plot_deconvolution(list('bayesprism' = deconvolution.results.bayesprism), "bar", "method", "Spectral")

Deconvolution of cell types at a lower resolution

The considered single-cell breast cancer dataset includes cell-type annotations at three levels of resolution: celltype_major, celltype_minor, and celltype_subset, which distinguish 9, 29, and 58 cell types respectively. The different cell-type annotations can be accessed with:

single.cell.data$celltype_major   # Major annotation
single.cell.data$celltype_minor   # Minor annotation
single.cell.data$celltype_subset           # Subset annotation

These additional annotations provide a cell-type classification at a finer resolution. For instance, at the celltype_major level, we only have the T cell population, while at the celltype_minor level, we can distinguish between CD4+ and CD8+ T cells. In the following, we will again perform deconvolution analysis with DWLS but, this time, using the celltype_minor information. We will subsample the dataset as before, this time considering the second level of resolution for the cell types, and extract the objects needed for deconvolution.

max_cells_per_celltype = 200


sampled.metadata <- single.cell.data@meta.data %>%
      rownames_to_column(., 'barcode') %>%
      group_by(., celltype_minor) %>% 
      nest() %>%            
      mutate(n =  map_dbl(data, nrow)) %>%
      mutate(n = min(n, max_cells_per_celltype)) %>%
      ungroup() %>% 
      mutate(samp = map2(data, n, sample_n)) %>% 
      select(-data) %>%
      unnest(samp)

single.cell.data.sampled <- subset(single.cell.data, cells = sampled.metadata$barcode)


as.data.frame(table(single.cell.data.sampled$celltype_minor, dnn = list("celltype_minor")), responseName = "number_cells")
##                 celltype_minor number_cells
## 1               B cells Memory          200
## 2                B cells Naive          200
## 3           CAFs MSC iCAF-like          200
## 4              CAFs myCAF-like          200
## 5              Cancer Basal SC          200
## 6               Cancer Cycling          200
## 7               Cancer Her2 SC          200
## 8               Cancer LumA SC          200
## 9               Cancer LumB SC          200
## 10                 Cycling PVL           37
## 11             Cycling T-cells          200
## 12             Cycling_Myeloid          200
## 13                         DCs          200
## 14           Endothelial ACKR1          200
## 15          Endothelial CXCL12          200
## 16 Endothelial Lymphatic LYVE1          183
## 17            Endothelial RGS5          200
## 18         Luminal Progenitors          200
## 19                  Macrophage          200
## 20              Mature Luminal          200
## 21                    Monocyte          200
## 22               Myoepithelial          200
## 23                    NK cells          200
## 24                   NKT cells          200
## 25                Plasmablasts          200
## 26          PVL Differentiated          200
## 27                PVL Immature          200
## 28                T cells CD4+          200
## 29                T cells CD8+          200
counts.matrix <- as.matrix(single.cell.data@assays$RNA@counts)
cell.type.annotations <- single.cell.data$celltype_minor
batch.ids <- single.cell.data$orig.ident
signature.matrix.dwls.minor <- omnideconv::build_model(single_cell_object = counts.matrix,
                        cell_type_annotations = cell.type.annotations,
                        method = 'dwls', 
                        dwls_method = 'mast_optimized')

deconvolution.results.dwls.minor <- deconvolute(bulk_gene_expression = bulk.data,
                                          signature = signature.matrix.dwls.minor,
                                          method='dwls',
                                          dwls_submethod = 'DampenedWLS')

We can visualize the results as before:

omnideconv::plot_deconvolution(list('dwls' = deconvolution.results.dwls.minor), "bar", "method", "Spectral")

Comparison of cell fractions across conditions

We can consider as well the metadata provided with the original paper, which include patient’s data, cancer subtype information and treatment details. We can first harmonize the sample names, and then combine all this information with the deconvolution results in one dataframe.

patient.metadata <- read_excel("./41588_2021_911_MOESM4_ESM.xlsx", sheet = 1, skip = 3) %>%
  select(., c(1, 4, 5, 11, 12))
colnames(patient.metadata) <- c('Sample', 'Grade', 'Cancer_type', 'IHC_subtype', 'Treatment')

patient.metadata$Sample <- gsub('-', '', patient.metadata$Sample) %>%
  paste0('CID', .)

patients.results <- rownames_to_column(as.data.frame(deconvolution.results.dwls), 'Sample') %>%
  gather(., key='celltype', value='cell_fraction', -Sample) %>%
  left_join(., patient.metadata)
## Joining with `by = join_by(Sample)`

Each condition has a different number of samples:

as.data.frame(table(patient.metadata$IHC_subtype, dnn = list("IHC_subtype")), responseName = "number_patients")
##   IHC_subtype number_patients
## 1         ER+              12
## 2       HER2+               3
## 3   HER2+/ER+               2
## 4        TNBC               9

We can visualize the estimated cell fractions across samples with a boxplot, and group the results either by IHC subtype or by treatment status

ggplot(patients.results, aes(x=celltype, y=cell_fraction, fill=IHC_subtype))+
  geom_boxplot()+
  ylab("cell fractions") +
  xlab("") +
  scale_fill_discrete(name="IHC subtype") +
  theme_bw() +
  theme(axis.text.x= element_text(angle=60, hjust = 1 )) 

HER2-positive samples are enriched in T cells, while the ER-positive and triple negative breast cancer (TNBC) samples show an enrichment in cancer and normal epithelial cells, respectively. The patient metadata includes information about eventual treatments undergone by the patients. We can see that 5 out of the 24 patients were treated with Neoadjuvant and/or Paclitaxel, while the other 19 were untreated. We can therefore visualize the distribution of cell fractions across treated and untreated patients

ggplot(patients.results, aes(x=celltype, y=cell_fraction, fill=Treatment))+
  geom_boxplot()+
  ylab("cell fractions") +
  xlab("") +
  scale_fill_discrete(name="Treatment") +
  theme_bw() +
  theme(axis.text.x= element_text(angle=60, hjust = 1 )) 

The major differences in this case are the enrichment of the CAFs and normal epithelial cells in the treated samples, as opposed to the cancer epithelial cells and B cells which have a higher median cell fraction in the untreated samples.

As mentioned before, with the lower level of annotations we can identify additional cell types, such as CD4+/CD8+ subtypes, Fibroblasts and cancer cells. We can therefore visualize the estimated cell fractions in a boxplot, focusing in particular on the CD4+ and CD8+ T-cell subtypes.

patients.results.minor <- rownames_to_column(as.data.frame(deconvolution.results.dwls.minor), 'Sample') %>%
  gather(., key='celltype', value='cell_fraction', -Sample) %>%
  left_join(., patient.metadata)
## Joining with `by = join_by(Sample)`
patients.results.minor %>%
  filter(celltype %in% c('T cells CD4+', 'T cells CD8+')) %>%
ggplot(., aes(x=celltype, y=cell_fraction, fill=IHC_subtype))+
  geom_boxplot()+
  ylab("cell fractions") +
  xlab("") +
  scale_fill_discrete(name="IHC subtype") +
  theme_bw() +
  theme(axis.text.x= element_text(angle=60, hjust = 1 ))

The estimates for the CD8+ T cells seem to be very low for all samples. On the other hand, in the HER2/ER positive samples the cell fractions estimated for the CD4+ T cells seem to be significantly higher than in the other subtypes. The estimates for the CD8+ T cells are close to zero for almost every sample, with very low fractions detected for the HER2+/ER+ ones.

Similarly, we can visualize the distribution of the estimates for the cancer associated fibroblasts (CAFs), respectively the myoblastic CAFs (myCAF) and the bone marrow derived inflammatory CAFs (MSC iCAF).

patients.results.minor %>%
  filter(celltype %in% c('CAFs myCAF-like', 'CAFs MSC iCAF-like')) %>%
ggplot(., aes(x=celltype, y=cell_fraction, fill=IHC_subtype))+
  geom_boxplot()+
  ylab("cell fractions") +
  xlab("") +
  scale_fill_discrete(name="IHC subtype") +
  theme_bw() +
  theme(axis.text.x= element_text(angle=60, hjust = 1 ))

Here we can notice that, while the myoblastic CAFs have a comparable median value, the MSC CAFs have higher infiltration in the samples characterized as HER2+/ER+ subtype.

Finally, we can visualize the distribution of the different cancer cells molecular subtypes that were described by the authors.

patients.results.minor %>%
  filter(celltype %in% c('Cancer Basal SC', 'Cancer LumA SC', 'Cancer LumB SC', 'Cancer Her2 SC')) %>%
ggplot(., aes(x=celltype, y=cell_fraction, fill=IHC_subtype))+
  geom_boxplot()+
  ylab("cell fractions") +
  xlab("") +
  scale_fill_discrete(name="IHC subtype") +
  theme_bw() +
  theme(axis.text.x= element_text(angle=60, hjust = 1 )) +
  labs(title = 'Cancer cells estimated fractions')

We can see that the HER2+ shows an overrepresentation of the cancer cells described as HER2-like by the authors. Similarly, the ER+ samples show higher fractions of Luminal B cancer cells. These findings on bulk RNA-seq are concordant with the subtypings that the authors described on the single cell RNA-seq (‘Results - scSubtype’).

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.